Skip to content

Conversation

@400Ping
Copy link
Contributor

@400Ping 400Ping commented Nov 27, 2025

Why are these changes needed?

In this PR, we fail the RayJob if the RayCluster is provisioned. However, this is a breaking change since we shouldn’t include other modes like K8sJobMode or HTTPMode.

Related issue number

Closes #4176

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

Signed-off-by: 400Ping <[email protected]>
@400Ping
Copy link
Contributor Author

400Ping commented Nov 28, 2025

cc @Future-Outlier @rueian PTAL

@400Ping 400Ping marked this pull request as draft November 29, 2025 07:34
Copy link
Member

@Future-Outlier Future-Outlier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @400Ping
let us know if you need any help, plz ping us

@400Ping
Copy link
Contributor Author

400Ping commented Dec 15, 2025

Ok, will do thanks.

Signed-off-by: 400Ping <[email protected]>
@400Ping 400Ping marked this pull request as ready for review December 15, 2025 16:03
Signed-off-by: 400Ping <[email protected]>
Signed-off-by: 400Ping <[email protected]>
Signed-off-by: 400Ping <[email protected]>
@400Ping 400Ping force-pushed the bug/sidecar-mode-fix branch from b8077b9 to c4bfd24 Compare December 18, 2025 11:37
return len(pods.Items)
}, TestTimeoutMedium, 2*time.Second).Should(Equal(1))
g.Consistently(RayJob(test, rayJob.Namespace, rayJob.Name), TestTimeoutShort).
ShouldNot(WithTransform(RayJobDeploymentStatus, Equal(rayv1.JobDeploymentStatusFailed)))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image CI keeps failing

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think when the head pod is deleted in default k8sJobMode, we will get into following block and the job deployment status will still be Failed. Don't know if this is the expected behavior?

if finishedAt != nil {
rayJobInstance.Status.JobDeploymentStatus = rayv1.JobDeploymentStatusFailed
rayJobInstance.Status.Reason = rayv1.AppFailed
rayJobInstance.Status.Message = "Submitter completed but Ray job not found in RayCluster."
break
}

Copy link
Collaborator

@win5923 win5923 Dec 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I now understand why this logic is written this way, we currently have two scenarios:

  1. The RayJob loses its head pod before the ray job is submitted.
  2. The RayJob loses its head pod after the ray job has already been submitted.

For scenario 1:

When the head Pod is deleted while the submitter Pod is attempting to submit the ray job, the submitter encounters a WebSocket connection failure and exits with a non-zero exit code. The K8s Job then creates a new submitter Pod to retry (the submitterBackoffLimit default is 2).

func (r *RayJobReconciler) createNewK8sJob(ctx context.Context, rayJobInstance *rayv1.RayJob, submitterTemplate corev1.PodTemplateSpec) error {
logger := ctrl.LoggerFrom(ctx)
submitterBackoffLimit := ptr.To[int32](2)

Meanwhile, the RayCluster controller recreates the head Pod and worker Pods (only sidecar mode skips head Pod restart).

If the new head Pod becomes ready before the K8s Job reaches backoffLimit, the new submitter Pod could successfully submit the ray job, and the RayJob could eventually transition to JobDeploymentStatusComplete with JobStatus = JobStatusSucceeded. But this is need a good luck.

For Scenario 2:
When the head Pod is deleted after the ray job has already been submitted, the K8s Job is marked as Completed because the Pod exits with exit code 0, even though the ray job itself has not completed.

And then, when the controller tries to query the ray job status via GetJobInfo, it returns BadRequest (job not found). The RayJob will transition JobDeploymentStatusFailed with JobStatus = JobStatusFailed. (This is fix by #3860)

jobInfo, err := rayDashboardClient.GetJobInfo(ctx, rayJobInstance.Status.JobId)
if err != nil {
// If the Ray job was not found, GetJobInfo returns a BadRequest error.
if errors.IsBadRequest(err) {
if rayJobInstance.Spec.SubmissionMode == rayv1.HTTPMode {
logger.Info("The Ray job was not found. Submit a Ray job via an HTTP request.", "JobId", rayJobInstance.Status.JobId)
if _, err := rayDashboardClient.SubmitJob(ctx, rayJobInstance); err != nil {
logger.Error(err, "Failed to submit the Ray job", "JobId", rayJobInstance.Status.JobId)
return ctrl.Result{RequeueAfter: RayJobDefaultRequeueDuration}, err
}
return ctrl.Result{RequeueAfter: RayJobDefaultRequeueDuration}, nil
}
// finishedAt will only be set if submitter finished
if finishedAt != nil {
rayJobInstance.Status.JobDeploymentStatus = rayv1.JobDeploymentStatusFailed
rayJobInstance.Status.Reason = rayv1.AppFailed
rayJobInstance.Status.Message = "Submitter completed but Ray job not found in RayCluster."
break

If not returns BadRequest then goes to JobDeploymentStatusTransitionGracePeriodExceeded.

func checkSubmitterFinishedTimeoutAndUpdateStatusIfNeeded(ctx context.Context, rayJob *rayv1.RayJob, finishedAt *time.Time) bool {
logger := ctrl.LoggerFrom(ctx)
// Check if timeout is configured and submitter has finished
if finishedAt == nil {
return false
}
// Check if timeout has been exceeded
if time.Now().Before(finishedAt.Add(DefaultSubmitterFinishedTimeout)) {
return false
}
logger.Info("The RayJob has passed the submitterFinishedTimeoutSeconds. Transition the status to terminal.",
"SubmitterFinishedTime", finishedAt,
"SubmitterFinishedTimeoutSeconds", DefaultSubmitterFinishedTimeout.String())
rayJob.Status.JobStatus = rayv1.JobStatusFailed
rayJob.Status.JobDeploymentStatus = rayv1.JobDeploymentStatusFailed
rayJob.Status.Reason = rayv1.JobDeploymentStatusTransitionGracePeriodExceeded
rayJob.Status.Message = fmt.Sprintf("The RayJob submitter finished at %v but the ray job did not reach terminal state within %v",
finishedAt.Format(time.DateTime), DefaultSubmitterFinishedTimeout)
return true
}

Copy link
Collaborator

@win5923 win5923 Dec 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As Kai-Hsun mentioned in slack, I think we can simply remove this line, since the behavior in Scenario 1 is not reliable.

I think it is pretty rare, and typically it will hit backoffLimit before the new Pod is ready and the whole cluster needs to restart. A lot of potential unexpected behavior here. Users should not rely on it.

https://ray.slack.com/archives/C07M825T94Z/p1762216441714309?thread_ts=1761694399.867989&cid=C07M825T94Z

Signed-off-by: Future-Outlier <[email protected]>
Signed-off-by: Future-Outlier <[email protected]>
Signed-off-by: Future-Outlier <[email protected]>
Signed-off-by: Future-Outlier <[email protected]>
Signed-off-by: Future-Outlier <[email protected]>
Copy link
Member

@Future-Outlier Future-Outlier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @400Ping do you mind solve the comments?

@Future-Outlier Future-Outlier moved this from can be merged to In Progress in @Future-Outlier's kuberay project Jan 7, 2026
Signed-off-by: 400Ping <[email protected]>
Signed-off-by: 400Ping <[email protected]>
@400Ping
Copy link
Contributor Author

400Ping commented Jan 10, 2026

Hi, @400Ping do you mind solve the comments?

Done, PTAL

Copy link
Collaborator

@win5923 win5923 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@400Ping 400Ping force-pushed the bug/sidecar-mode-fix branch from 556907f to c0d916d Compare January 11, 2026 10:59
Signed-off-by: 400Ping <[email protected]>
Signed-off-by: 400Ping <[email protected]>
@Future-Outlier Future-Outlier moved this from In Progress to can be merged in @Future-Outlier's kuberay project Jan 12, 2026
@rueian rueian merged commit fdc0e9a into ray-project:master Jan 14, 2026
29 checks passed
@github-project-automation github-project-automation bot moved this from can be merged to Done in @Future-Outlier's kuberay project Jan 14, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug][RayJob] Sidecar mode shouldn't restart head pod when head pod is deleted

9 participants